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Abstract 

In 1996, shortly after the founding of The Cochrane Collaboration, leading figures in test evaluation research 
established a Methods Group to focus on the relatively new and rapidly evolving methods for the systematic 
review of studies of diagnostic tests. Seven years later, the Collaboration decided it was time to develop a 
publication format and methodology for Diagnostic Test Accuracy (DTA) reviews, as well as the software needed 
to implement these reviews in The Cochrane Library. A meeting hosted by the German Cochrane Centre in 2004 
brought together key methodologists in the area, many of whom became closely involved in the subsequent 
development of the methodological framework for DTA reviews. DTA reviews first appeared in The Cochrane 
Library in 2008 and are now an integral part of the work of the Collaboration. 



Background 

Finding good evidence regarding the performance of 
diagnostic tests and interpreting its value for practice is 
more challenging and less straightforward than for inter- 
ventions. Most diagnostic studies focus on diagnostic test 
accuracy, which expresses a tests ability to discriminate 
between people with the target condition and those 
without it [see Additional file 1]. However, estimates of 
test accuracy often vary markedly between studies. Such 
heterogeneity may reflect differences between studies 
in the criterion used to define test positivity, study 
design and patient characteristics as well as the place 
of the test in the diagnostic pathway [1-3]. Furthermore, 
a highly accurate test does not necessarily improve a 
patients outcome [4], Systematic reviews of diagnostic 
test accuracy summarize the evidence about test accuracy. 
Ideally, they also investigate why the results may vary 
among studies, compare the performance of alternative 
tests, and help the reader to put the evidence in a clinical 
context [5,6]. 

In the early 1990s, several researchers led by Les Irwig 
and Paul Glasziou were working on methods for the 
systematic review of diagnostic test accuracy and identified 
The Cochrane Collaboration as an obvious place where 
health professionals looking for evidence about diagnostic 
tests should be able to go. After an initial meeting at 
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the 2 n Cochrane Colloquium in Hamilton, Ontario on 
2 October 1994, the Cochrane Screening and Diagnostic 
Test Methods Group was founded and formally registered 
in the Collaboration in 1996. It initially focused on 
identifying a common method for preparing diagnostic 
test accuracy reviews. 

One of their goals was to include diagnostic test accuracy 
(DTA) reviews in The Cochrane Library. However, largely 
because of the limited resources available, the Steering 
Group of The Cochrane Collaboration decided that, in 
1996, the Collaboration was not ready to include such a 
methodologically challenging review type. Seven years later, 
in 2003, Jon Deeks and Constantine Gatsonis persuaded 
the Collaboration to revisit the question of inclusion of 
DTA reviews. The Cochrane Collaboration was then ten 
years old and had proven its value for decisions about 
interventions, and important advances had been made 
on the methodology for diagnostic test accuracy reviews. 
The Collaboration decided that the time was right to plan 
for the inclusion of systematic reviews of diagnostic test 
accuracy studies in The Cochrane Library. A Cochrane 
Diagnostic Reviews Working Group, led by Jon Deeks, 
Constantine Gatsonis and Patrick Bossuyt with members 
of the Methods Group, software experts, editors of 
Cochrane Review Groups and interested authors was 
established to plan and undertake the work required 
for the Collaboration to deliver on these reviews [see 
Additional file 2], 

The first step involved achieving consensus on a core 
method. The following year, the proposers of the Bayes' 
Library (led by Matthias Egger and Daniel Pewsner), 
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members of the Cochrane Screening and Diagnostic Test 
Methods Group, and other international experts met 
together in Freiburg, Germany, to discuss and agree on 
appropriate methods for each step in a meta-analysis of 
diagnostic test accuracy, including graphical displays. 
The Bayes Library proposal was radically different in that 
it considered producing a database of meta-analytical 
estimates of likelihood ratios and pre-test probabilities, 
which could be used for probability revision in Bayesian 
diagnostic thinking. After debate, consensus was reached 
on following a more standard methodology that utilised 
sensitivity and specificity estimates. Following the meeting, 
members of the Cochrane Screening and Diagnostic 
Test Methods Group assisted Collaborations Information 
Management Team with the development of a version 
of the Collaborations Review Manager software including 
functions necessary for DTA reviews and worked with 
the Collaborations publisher to develop a publication 
format. Unlike the software for intervention reviews, which 
includes the ability to calculate and display the results 
of meta-analyses of the included studies, an approach 
was taken for linking the Collaboration s software with 
commercial statistical software packages that contained 
the functionality necessary to fit the complex hierarchical 
statistical models for meta-analysis. 

The Cochrane Library was ready to register titles for 
diagnostic test accuracy reviews in October 2007, with 
the publication of the first Cochrane diagnostic test 
accuracy review in October 2008 [7]. During this period, 
members of the Cochrane Screening and Diagnostic Test 
Methods Group worked not only on the development of 
the above mentioned methods, but also on the develop- 
ment of pilot reviews and guidance in the form of a 
Handbook. Support Units were established in the United 
Kingdom and The Netherlands to assist the Cochrane 
Review Groups with publication preparation and processes 
surrounding these reviews; a website was launched, train- 
ing workshops were provided and a separate Editorial 
Team was established to oversee DTA reviews [8]. 

In the following sections, we highlight some of the 
methodological developments in diagnostic systematic 
reviews that took place from the early 1990s until now, 
against the background of the history outlined above. 
Current challenges and possible solutions for them are 
discussed, and we conclude with an overview of the 
current status of these reviews within The Cochrane 
Collaboration. 

Early methodology 

The first meta-analyses of diagnostic test accuracy were 
published in the late 1980s and early 1990s and largely 
followed the approaches used for intervention meta- 
analyses: retrieval and selection of studies, assessing their 
quality, summarizing their results in a meta-analysis, 



investigating heterogeneity and drawing conclusions for 
example, [9,10]. However, meta-analysis of diagnostic 
test accuracy was intrinsically more complex because test 
accuracy measures usually come in pairs: sensitivity and 
specificity; positive and negative predictive values; and 
positive and negative likelihood ratios. A key consideration 
is that accuracy measures depend on the threshold that 
is used to define a positive test result. Sensitivity and 
specificity, which are commonly reported, vary in opposite 
directions as the threshold changes. An early regression 
based method that did take this into account was not 
straightforward to fit [10]. Another approach used the 
area under the receiver operating characteristic (ROC) 
curve to provide a single summary measure of accuracy 
per study, thus losing information about threshold effects 
[11]. A major breakthrough in the meta-analysis of diag- 
nostic test accuracy was the publication of the statistical 
method developed by Moses, Littenberg and colleagues, 
which was straightforward to implement and also took 
the threshold effect into account [12,13]. This method 
was widely adopted in subsequent reviews. 

The complexity of DTA reviews is not restricted to 
statistical methods. Even formulating the review question 
may not be straightforward because the accuracy of a test 
can vary in different situations. For instance, study design 
may affect estimated accuracy, and there is no 'best' design 
analogous to the use of the randomized trial to compare 
interventions. Furthermore, there is no standard ter- 
minology to describe the variety of study designs used 
to assess accuracy. Consequently, it is more difficult to 
retrieve relevant studies from electronic databases and 
the selection process is more complex. Interpretation 
of summary estimates from a DTA review also requires 
careful consideration because a highly accurate test in 
itself will not improve the patients outcome. It is the 
management of the patient and decisions made after 
the test is administered that directly influence the 
patients wellbeing. These more epidemiological issues 
and considerations for the meta-analysis of test accuracy 
studies were published in parallel with the statistical 
developments [5,14]. After almost 20 years, these 
guidelines [5] are still very relevant and current. 

Recent developments 

At the time that the Cochrane Collaboration Steering 
Group decided that it would consider diagnostic test 
accuracy reviews, it appeared that the methods for these 
reviews were well defined [15,16] and all that remained 
was to reach consensus about which methods to adopt. 
However, as the discussions progressed, limitations of 
existing commonly used approaches became clear, and 
ideas for alternative methods and further developments 
were generated. These are outlined below. 
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Question formulation and the interpretation of results 

There was an increasing awareness that because tests 
are used in a range of contexts, their value very much 
depends on their place and role in clinical practice [17]. 
This also affects the interpretation and applicability of 
the findings: Do the findings hold for any situations, or 
do different situations cause the test to behave differently? 
For example, questionnaires to determine whether elderly 
patients are developing dementia may be of value in 
general practice. However, when such a questionnaire 
is used in a mental health clinic where patients have 
many multiple symptoms in common, the questionnaire is 
no longer able to distinguish between someone with 
general mental impairment and someone with dementia. 

Even if such a questionnaire could distinguish very 
well between people with general cognitive impairment 
and someone with dementia, its value may still depend 
on other factors such as whether the knowledge that 
someone has dementia rather than general cognitive 
impairment will affect their outcomes and quality of 
life. The potential consequences of a positive or negative 
test result should be taken into account when interpreting 
the results of a DTA review. If knowledge of the test result 
does not affect further management, the value of testing 
at that point may be very limited. 

When formulating the review question, one should 
also realize that diagnostic tests are not used in isolation 
and that alternatives should be considered as well. 
Therefore, Cochrane DTA reviews have also turned 
their focus on the importance of comparative accuracy, 
because choosing a test requires robust information about 
the value it adds compared to existing alternatives. 

Search and selection 

Studies of the relative effects of different intervention 
are relatively easy to find by searching for randomized 
trials. Searching for studies of diagnostic test accuracy is 
far more difficult because the study designs vary and there 
is no one term that can be used to filter all diagnostic 
studies. Multiple combinations of methodological terms 
have been tried, resulting in the development of so called 
methodological search filters'. However, it has become 
clear that searching for diagnostic accuracy studies 
involves more than filtering studies for their use of 
diagnosis-related terms [18,19]. As a result, review authors 
are often forced to screen thousands of retrieved article 
titles in order to find a relatively small number of poten- 
tially relevant studies. 

Quality assessment 

The first published empirical investigation of the effect 
of a range of potential biases on diagnostic accuracy out- 
comes was published in 2002 [20]. An overview of all 
potential sources of bias and variation was published two 



years later and formed the basis of a Quality Assessment 
for Diagnostic Accuracy Studies (QUAD AS) tool [21,22]. 
This tool consisted of 14 items and has been widely 
used by authors of diagnostic test accuracy reviews. A 
modified form of QUADAS became the recommended 
quality assessment tool for Cochrane diagnostic accuracy 
reviews [23]. 

As the tool became more widely used, it became 
apparent that it had some drawbacks such as not 
distinguishing adequately between true biases and reporting 
biases, and also not distinguishing between risk of bias 
and issues of applicability or representativeness. In response 
to these limitations, an updated version of the tool was 
developed and published in 2011 [24]. This version, 
which is now used for Cochrane DTA reviews, allows the 
assessment of both risk of bias and concerns regarding 
applicability in an explicit and transparent way. 

Meta-analysis 

As outlined above, the statistical approach developed by 
Moses and Littenberg was widely adopted as it was 
straightforward to apply and understand. Alternative, but 
substantially more complex statistical approaches were 
published in the mid 1990s, providing a framework for 
more rigorous methods taking proper account of within 
study variability in sensitivity and specificity, and unex- 
plained heterogeneity in test accuracy between studies. 
[25,26]. These more rigorous methods are the basis 
for the hierarchical models that are recommended for 
Cochrane DTA reviews and that are increasingly used in 
preference to the original Moses and Littenberg method. 

Both of these hierarchical models use an estimate of 
test sensitivity and specificity for each study. The first 
model, commonly referred to as the Rutter and Gatsonis 
Hierarchical Summary ROC (HSROC) model, focuses on 
the estimation of a summary ROC curve that allows for 
threshold effects (Figure 1A) [27]. A modification of this 
approach was identified to fit this model in SAS software, 
which has facilitated its adoption [28]. A second model, 
commonly referred to as the bivariate model, performs a 
joint meta-analysis of logit transformed sensitivity and 
specificity, allowing for correlation between them across 
studies, with the aim of obtaining a summary estimate for 
both sensitivity and specificity (Figure IB) [29]. Further 
work on these models has demonstrated that they are 
mathematically equivalent, but the different parameter- 
isations affect the interpretation of covariates included 
in the models [30,31]. 

The Rutter and Gatsonis (HSROC) model assumes that 
each test is subject to a threshold effect, either explicitly 
by applying a different cut-point in the case of continuous 
test results, or implicitly as occurs in imaging studies. 
Under the HSROC model, threshold effects between 
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Figure 1 Summary receiver-operating characteristic (ROC) plots showing test accuracy of cytology for detecting primary bladder 
cancer [32]. A) The summary ROC curve, representing the underlying relationship between sensitivity and specificity for the test across varying 
thresholds. B) The summary sensitivity and specificity and a 95% confidence region around it. The smaller oval shaped symbols in both graphs 
show the individual study results, with the height of the symbol representing the number of diseased individuals and the width of the ovals 
representing the number of non-diseased individuals. 



studies are accounted for by a proxy measure for threshold 
that is based on an underlying test positivity rate in each 
study. If thresholds vary between studies, estimating one 
overall summary pair of sensitivity and specificity is not 
appropriate or readily interpretable because the sensitivity 
and specificity will vary by threshold. The bivariate 
model adopted by Reitsma and colleagues focuses on the 
estimation of a summary pair of sensitivity and specificity 
on the basis that clinicians require this information to 
assess the consequences of decisions made after a test 
result is known. Clearly, this approach requires that the 
study specific estimates of sensitivity and specificity for 
a test are obtained using a common criterion (threshold) 
for test positivity for the summary estimates to have a 
clear interpretation. Because of these considerations, 
review authors are advised to think carefully about the 
questions they aim to address in their review and the 
type of test they are analyzing to guide their choice of 
model [33]. 

Future developments 

With most of the basic methods now developed and 
available as guidance for review authors [6,8], it is time to 
consider future directions. Some ongoing developments 
may make the process of preparing a systematic review of 
diagnostic test accuracy easier, but other developments 
may lead to greater complexity. 

Search and selection 

Developments in text mining and machine learning 
techniques may make the search and selection of studies 
an easier task. These techniques may help in developing 
search strategies, but their biggest advantage will probably 



be in the stages of study selection The software can be 
trained to recognize relevant studies from irrelevant 
studies, allowing automatic filtering out of the clearly 
non-relevant studies at the first selection stage. The 
techniques may also be used in place of a second or 
third reviewer, being more objective and perhaps also 
more consistent than a human reviewer. This could 
facilitate the handling of disagreements in the selection 
stage. 

Publication bias 

In diagnostic research, not much is known about the 
'drivers' behind publication bias. A diagnostic accuracy 
study usually does not test a hypothesis and so there is 
no P value for authors and publishers to influence deci- 
sions about publication that are based on the statistical 
significance of the results. Investigating what drives the 
publication of a diagnostic study is difficult because no 
formal registration of these studies exists, and because 
these studies may also be done on ad-hoc basis using 
pre-existing data or samples. In the light of the current 
developments with regard to the ensuring publication of 
each trial ever done (see www.alltrials.net), it would be 
good to set similar standards for accuracy studies. Until 
then, we should urge review authors to put extra effort 
into finding unpublished, as well as published diagnostic 
test accuracy studies. This will also help to inform factors 
associated with non-publication, thereby informing the 
further development of approaches for assessing potential 
publication bias [34,35]. 

Meta-analysis 

In terms of statistical methods, future developments are 
likely to reflect the increasing interest in comparative 
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accuracy of tests. Alternative tests are generally available; 
hence, it is appropriate to evaluate the accuracy of a test 
not in isolation, but relative to relevant alternative tests. 
Unfortunately, studies that directly compare tests are not 
common and meta-analyses to compare tests must often 
rely on a set of studies that evaluated one of the tests 
(test A) and a different set of studies that have evaluated 
the alternative test (test B). This indirect approach would 
not be acceptable in a systematic review to compare 
the effectiveness of two interventions, but is common 
practice when comparing tests because of the limitations 
of available data. Nevertheless, developments in the 
area of indirect comparisons and multiple treatment 
comparison meta-analyses for intervention studies may 
help to guide future methodological developments for 
DTA comparative meta-analyses [36]. At present, the 
routinely used models for DTA meta-analysis utilise 
data on a single sensitivity and specificity pair for each 
study. Hence, current models do not fully utilise all of 
the available data. Some progress has been made in this 
area [37], but more general and robust methods are 
required. 

Interpretation and summary of findings 

A major focus of DTA reviews is to obtain summary 
estimates of test accuracy. However, knowing that a test 
has a high sensitivity for instance does not tell us 
whether the test will have much impact on the patient, 
nor does it tell us that using this test in practice will be 
beneficial for the patient, or cost-effective. Improved 
accuracy is not even necessary for patient benefit to 
occur because new tests may improve outcomes if they 
can be used on a wider patient group, are less invasive, 
or allow time-critical effective therapy to be given earlier 
[38]. Although a GRADE approach for diagnostic tests 
has now been developed, providing guidance on how to 
translate accuracy data into a recommendation involving 
patient important outcomes requires much more con- 
sideration [39]. 

Conclusions 

Preparing a diagnostic test accuracy review is likely to be 
very time consuming and challenging. The challenges 
start at the point of question formulation. Most chapters 
of the Cochrane Handbook for Diagnostic Test Accuracy 
Reviews have been published and software is available to 
facilitate the review process and meta-analysis. In April 
2013, the titles for around Cochrane DTA reviews have 
been registered. With 13 published reviews and 61 
published protocols in Issue 4, 2013 of The Cochrane 
Library, the DTA reviews are now an established part 
of the Library and may serve as an example for the 
inclusion of future new review types. 
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